今天我們就要來看看如何實現Q learning!
code參考這篇製作Q* Learning with FrozenLakev2.ipynb
我們使用Colab來當作我們的實作平台,並使用Open ai來完成。
Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.
擷取至https://gym.openai.com/envs/FrozenLake-v0/
簡單來說,就是要讓玩家到達目的地就贏了。
4X4的地圖
S代表出發點,F是可以走的路,H是破洞(走到會死掉),G是終點(走到就贏了)
上下左右
共有16格,所以有16個state
因為每一個狀態都可以有4種動作,所以Q table大小為
$4×4×4=64$
先生成一個都為0的Q表
透過env.action_space.n以及env.observation_space.n就可以得到Q表的長以及寬。
import numpy as np
import gym
import random
env = gym.make("FrozenLake-v0")
action_size = env.action_space.n
state_size = env.observation_space.n
# Create our Q table with state_size rows and action_size columns (64x4)
qtable = np.zeros((state_size, action_size))
total_episodes = 20000 # Total episodes
learning_rate = 0.8 # Learning rate
max_steps = 50 # Max steps per episode
gamma = 0.95 # Discounting rate
# Exploration parameters
epsilon = 1.0 # Exploration rate
max_epsilon = 1.0 # Exploration probability at start
min_epsilon = 0.01 # Minimum exploration probability
decay_rate = 0.005 # Exponential decay rate for exploration prob
# List of rewards
rewards = []
# 2 For life or until learning is stopped
for episode in range(total_episodes):
# Reset the environment
state = env.reset()
step = 0
done = False
total_rewards = 0
for step in range(max_steps):
# 3. Choose an action a in the current world state (s)
## First we randomize a number
exp_exp_tradeoff = random.uniform(0, 1)
## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
if exp_exp_tradeoff > epsilon:
action = np.argmax(qtable[state,:])
#print(exp_exp_tradeoff, "action", action)
# Else doing a random choice --> exploration
else:
action = env.action_space.sample()
#print("action random", action)
# Take the action (a) and observe the outcome state(s') and reward (r)
new_state, reward, done, info = env.step(action)
# Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
# qtable[new_state,:] : all the actions we can take from new state
qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
total_rewards += reward
# Our new state is state
state = new_state
# If done (if we're dead) : finish episode
if done == True:
break
# Reduce epsilon (because we need less and less exploration)
epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
rewards.append(total_rewards)
print ("Score over time: " + str(sum(rewards)/total_episodes))
print(qtable)
可以看到在第18行的部分,如果exp_exp_tradeoff大於epsilon時,就會以Q表的結果來選擇動作,否則就隨機做一個動作。
33行的部分即是Q learning的精華!複習一下
另外在第44行的部分,為了減少探索的機率,因此會慢慢地減少epsilon值。
# 輸入這行即可得到環境目前的狀況
env.render()
最快13步即可達到目的地!
今天看到了Q learning的實作,並了解其中如何實現。
https://gym.openai.com/envs/FrozenLake-v0/
Q* Learning with FrozenLakev2.ipynb